conda install pandas numpy scikit-learn matplotlib jupyter altair seaborn python-graphvizconda list will show you a list of currently installed packagesimport pandas as pd
import numpy as np
from PIL import Image
import sys
sys.path.append('code/')
from toy_classifier import classify_image
img = Image.open("img/apple.jpg")
img
classify_image(img, 5)
Are the following supervised or unsupervised problems?
Are the following classification or regression problems?
read_csv() to import a toy datasetdf = pd.read_csv('data/cities_USA.csv', index_col=0).sample(6, random_state=100)
df
red or blue depending on how they voted in the 2012 election.Of particular note:
df
df.shape
Feature_1 = 0.47Feature_1 = 0.47, using Feature_2 = 0.52Feature_1 > 0.47 and Feature_2 < 0.52 classify as **BLUE**Feature_1 > 0.47 and Feature_2 > 0.52 classify as **ORANGE**Feature_1 = 0.47, using Feature_2 = 0.6if statements:Feature_1 < 0.47 and Feature_2 < 0.6 classify as **ORANGE**Feature_1 < 0.47 and Feature_2 > 0.6 classify as **BLUE**Feature_1 > 0.47 and Feature_2 < 0.52 classify as **BLUE**Feature_1 > 0.47 and Feature_2 > 0.52 classify as **ORANGE**"Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities." Source: scikit-learn website
conda install scikit-learn
from sklearn.module import algorithmDecisionTreeClassifier) sits within the tree modulefrom sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=2, random_state=1)
model
df
X = df.drop(columns=['vote'])
y = df[['vote']]
X
y
.fit() method to train our model using the X and y datamodel.fit(X, y)
import graphviz
from sklearn.tree import export_graphviz
dot_data = export_graphviz(model)
graphviz.Source(export_graphviz(model,
out_file=None,
feature_names=X.columns,
class_names=["blue", "red"],
impurity=False))
altair to make this plot (which you may not have seen). It makes very nice plots but requires some wrangling to get data into a suitable format for use with the package. You do not need to learn to use Altair in this course, all your plotting for this course may be done in matplotlibimport altair as alt # altair is a plotting library
import sys
sys.path.append('code/')
from model_plotting import plot_model, plot_regression_model, plot_tree_grid # these are some custom plotting scripts I made
plot_model(X, y, model)
.predict() method of our modelXmodel.predict(X)
.to_numpy() simply changes a dataframe to a numpy array, and np.squeeze() squeezes the result to a 1d array. The only reason I'm using these commands is so we can easily compare the output to the output of .predict() above.np.squeeze(y.to_numpy())
.score() methodmodel.score(X, y)
np.atleast_2d() function ensures an array is 2Dmodel.predict(np.atleast_2d([-90, 40]))
plot_model(X, y, model)
fit, the ML algorithm is learning a bunch of valuesfit on a specific data set, we can set some "knobs" that control the learningmax_depthmodel = DecisionTreeClassifier(max_depth=1).fit(X, y) # max_depth of 1 is called a "Decision Stump"
dot_data = export_graphviz(model)
graphviz.Source(export_graphviz(model,
out_file=None,
feature_names=X.columns,
class_names=["blue", "red"],
impurity=False))
plot_model(X, y, model)
model.score(X, y)
max_depth valuesmax_depth = 5model = DecisionTreeClassifier(max_depth=5).fit(X, y)
dot_data = export_graphviz(model)
graphviz.Source(export_graphviz(model,
out_file=None,
feature_names=X.columns,
class_names=["blue", "red"],
impurity=False))
plot_model(X, y, model)
model.score(X, y)
To summarize this section:
from sklearn.tree import DecisionTreeRegressor
X = np.atleast_2d(np.linspace(-7, 6.5, 60)).T
y = (np.sin(X.T**2/5) + np.random.randn(60)*0.1).T
max_depth=1max_depth = 1 # hyperparameter for tree model
model = DecisionTreeRegressor(max_depth=max_depth).fit(X, y)
plot_regression_model(X, y, model)
max_depth?plot_tree_grid(X, y, max_depth=[1, 3, 5, 10])
max_depth smooths over the scatter in our datamax_depth fits more closely to the scatter in our datadf = pd.read_csv('data/cities_USA.csv', index_col=0)
Your tasks:
max_depth values based on this data.plot_model() code (or some other method)max_depth value would you choose to predict this data?max-depth value to predict new data?.fit() stage or .predict() stage?import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import sys
sys.path.append('code/')
from model_plotting import plot_model
df = pd.read_csv('data/cities_USA.csv', index_col=0)
# 1
print(f"There are {df.shape[1]-1} features and 1 target.")
# 2
print(f"There are {df.shape[0]} observations.")
# 3/4/5
X = df.drop(columns='vote')
y = df[['vote']]
for max_depth in [1, 5, 10]:
model = DecisionTreeClassifier(max_depth=max_depth).fit(X, y)
print(f"For max_depth={max_depth}, accuracy={model.score(X, y):.2f}.")
display(plot_model(X, y, model))
# 6
# Most of the computational effort takes places in the .fit() stage, when we create the model.
max_depth (or other hyperparameters)?max_depth for every supervised learning problem and get super high accuracy?